Polarity Classification of Bitcoin,Netflix and Twitter Data using Sentiment Analysis.

1.Introduction

Bitcoin is an advanced digital money and installment framework that is altogether decentralized, which means it depends on peer to-peer exchanges with no bureaucratic oversight. Bitcoin offers a novel open door for forecast due its relatively young age and resulting volatility. On the other hand, Netflix is an American entertainment company founded by Reed Hastings and Marc Randolph on August 29, 1997, in Scotts Valley, California.It provides streaming media, video-on-demand online, and DVD by mail. In 2013, Netflix expanded into film and television production as well as online distribution. This project is divided into 3 parts: First,finalize a model: train a final model using train/test splits before making any prediction. Second,classification and Regression Predictions: apply supervised learning algorithms where giving input variables, the model learns a mapping to suitable output quantities which is the daily price change (UP and DOWN) in this case. Third, Sentiment Analysis using Twitter Data : apply some computational tasks to automatically determine what feelings a writer is expressing in a text that affect the price of Bitcoin and Netflix.

Why Netflix and Bitcoin :

-Netflix added far more users than expected in the first quarter and posted quarterly earnings. Netflix’s addition of more than 7.4 million international subscribers set a new record, marking growth of 50 percent from a year ago.

-Bitcoin :There is Only Four Million Bitcoin Left to Be Mined out of 21 million. Every ten minutes, a new Bitcoin block is created by Bitcoin miners, which is a subgroup of the people running computer nodes that keep Bitcoin operational.

id name symbol rank price_usd price_btc X24h_volume_usd market_cap_usd
bitcoin Bitcoin BTC 1 9258.58 1.0 7969100000.0 157507999921
ethereum Ethereum ETH 2 718.103 0.0777043 3077020000.0 71232592599.0
ripple Ripple XRP 3 0.866715 0.00009379 582677000.0 33935230007.0
bitcoin-cash Bitcoin Cash BCH 4 1506.11 0.162972 1380980000.0 25764854333.0
eos EOS EOS 5 18.9222 0.00204753 1701250000.0 15722946591.0
cardano Cardano ADA 6 0.38785 0.00004197 308483000.0 10055814308.0

The above visualization explains the whole cryptocurrency market is propped by one currency primarily – Bitcoin which is the driving factor of this market. But it is also fascinating (and shocking at the same time) that Bitcoin and create a 100 Billion Dollar (USD) market. Whether this is a sign of bubble or no – We’ll leave that for market analysts to speculate, but being a data scientist or analyst, We have a lot of insights to extract from the above data and it should be interesting analysing such an expensive market.

2.Packages Required : rvest, httr, curl, jsonlite, tidytext, tidyverse, stringr, magrittr, knitr, printr, twitteR, ROAuth, bitops, RCurl, stringr, NLP, tm, ggplot2, ggmap, plyr, dplyr, RColorBrewer, wordcloud, ggcorrplot, nnet, kernlab, jpeg, randomForest, foreign, nnet, reshape2, httr, tidyr, pls, plotly, reshape2, coinmarketcapr, treemap, printr.

3.Data Preparation

Data history: Bitcoin from coin market for past 5 years, Netflix historical price from finance.yahoo for past 5 years and tweets for Bitcoin and Netflix from Twitter (7000 tweets for each). Beforehand, I collected four data sets, Netflix data which is literally almost clean to use, I just fix some variables such as the date that was changed to Unix time and use the regular expression to clean the data. On the other hand, Bitcoin data was very difficult to work with, because it was per minute data information with many missing values and many meaningless variables to deal with. I spent more time to clean the Bitcoin data. I first wrote a for loops to calculate the daily average information about each variable and then mutate new variable with daily information to change the data into a daily dataset. The reason why I did this, is to avoid overfitting data with multicollinearity. The last two data sets are for Sentiment Analysis using tweets from Twitter. I extracted 4000 tweets for Bitcoin and Netflix each using Twitter API key and cleaned the data, which was also difficult to clean. Actually, more difficult than Bitcoin data. After all difficult cleaning, I saved the final datasets and upload them to GitHub. Anyone interesting in using the data sets for further analysis can access and download them directly using this link: https://github.com/selecta21/Project_Boubacar

 4.Exploratory Data Analysis

 4.1 Correlation between variables

As we can see from the above correlograms, there are some highly correlated variables into our data frames. Since we have multicollinearity issue: let’s run PCR to deal with multicolinearity, because PCR can take care of that issue.

Advantages of performing PCR:

-Dimensionality reduction: reducing the model complexity.

-Avoidance of multicollinearity between predictors: A significant benefit for our data, because there is some degree of multicollinearity between the variables, this procedure should be able to avoid this problem since performing PCA on the raw data produces linear combinations of the predictors that are uncorrelated.

-Overfitting mitigation: all assumptions underlying PCR hold, thus we fitted a least squares model to the principal components and we got better results than when we fitted a least squares model to the original data since most of the variation and information related to the dependent variable is condensed in the principal components and by estimating less coefficients we reduced the risk of overfitting.

     4.2 PCA analysis
## Data:    X dimension: 1173 6 
##  Y dimension: 1173 1
## Fit method: svdpc
## Number of components considered: 6
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.9964   0.9965   0.9927   0.9924   0.9946   0.9961   0.9949
## adjCV       0.9964   0.9964   0.9926   0.9922   0.9943   0.9957   0.9944
## 
## TRAINING: % variance explained
##      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X   88.87729   98.519   99.934   99.978   99.996  100.000
## gb   0.09789    1.028    1.239    1.425    1.431    1.734
## Data:    X dimension: 944 5 
##  Y dimension: 944 1
## Fit method: svdpc
## Number of components considered: 5
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
## CV               1    1.002    1.004    1.005    1.001    1.002
## adjCV            1    1.002    1.004    1.004    1.000    1.001
## 
## TRAINING: % variance explained
##      1 comps  2 comps  3 comps  4 comps  5 comps
## X   79.84571  97.1868  99.9830  99.9931   100.00
## gn   0.07837   0.1009   0.2412   0.9976     1.08

As you can see, two main results are printed, namely the validation error and the cumulative percentage of variance explained using n components.

As lower values of RMSE indicate better fit, 3 components already explained most of the variablility of our bitcoin data and 4 components for netflix data. But 2 components for each also are enough to explain the variablility as shown below:

As we can see from our PCA projection , our Netflix (as well as Bitcoin data not showing on the plot) data is non linearly separable. So we are going to apply Kernel PCA which is a technique for non linearly separable data. Kernel PCA is an extension of principal component analysis (PCA) using techniques of kernel methods.

       4.3 Applying Kernel PCA ( Bitcoin data)
## 
## Call:
## glm(formula = y ~ ., family = binomial, data = B_pre)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.432  -1.225   0.958   1.102   1.246  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.192312   0.059036   3.258  0.00112 **
## V1          -0.009211   0.004425  -2.082  0.03737 * 
## V2           0.017522   0.005540   3.163  0.00156 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1615.6  on 1172  degrees of freedom
## Residual deviance: 1601.2  on 1170  degrees of freedom
## AIC: 1607.2
## 
## Number of Fisher Scoring iterations: 4
##     y.pred
##       -1   1
##   -1  61 129
##   1   40 161

With a Test error of 43%, our classification is better than random guess despite the complexity 
of predicting price change for bitcoin

 4.4 Random Forest for bitcoin 

We applied Random Forest algorithm above, because there are enough trees in the forest, the classifier won’t overfit the model. It produces the best possible split because highly correlated variables won’t cause multi-collinearity issues in random forest model. Since we are dealing with predicting the price change which is almost impossible to separate the separate the target variable (price up and down).

4.5 Applying Kernel PCA (Netflix)
## 
## Call:
## glm(formula = y ~ ., family = binomial, data = B_pre)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.295  -1.199   1.075   1.154   1.175  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.080690   0.065189   1.238    0.216
## V1           0.002816   0.004641   0.607    0.544
## V2          -0.006143   0.006726  -0.913    0.361
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1307.1  on 943  degrees of freedom
## Residual deviance: 1305.9  on 941  degrees of freedom
## AIC: 1311.9
## 
## Number of Fisher Scoring iterations: 3

With a Test error of 42%, our classification is better than random guess despite the complexity 
of predicting price change for netflix stock  
  



 4.6 Random Forest for Netflix

Same as for random forest for bitcoin except little more overfit.

4.7 Polarity classification Since the rise of social media, a large part of the current research has been focused on classifying natural language as either positive or negative sentiment. Polarity classification have been found to achieve high accuracy in predicting change or trends in public sentiment, for a myriad of domains (e.g. stock price prediction, bitcoin price change)

##  [1] "a"              "and"            "are"            "at"            
##  [5] "be"             "bitcoin"        "blockchain"     "brother"       
##  [9] "btc"            "by"             "cash"           "coin"          
## [13] "crypto"         "cryptocurrency" "do"             "escobars"      
## [17] "ethereum"       "for"            "free"           "get"           
## [21] "has"            "how"            "i"              "ico"           
## [25] "in"             "is"             "it"             "just"          
## [29] "like"           "more"           "murthaburke"    "n"             
## [33] "nblockchain"    "new"            "now"            "of"            
## [37] "on"             "only"           "our"            "out"           
## [41] "pablo"          "per"            "price"          "rt"            
## [45] "than"           "that"           "the"            "this"          
## [49] "to"             "up"             "we"             "what"          
## [53] "will"           "with"           "you"            "your"

Those are the most frequent words used in bitcoin tweets.

As we can see in the above picture, bitcoin is most used word after removing stop words.

Below are the most frequent words after removing stop words
##   [1] "a"               "about"           "airdrop"        
##   [4] "all"             "amp"             "an"             
##   [7] "and"             "are"             "as"             
##  [10] "at"              "bank"            "be"             
##  [13] "bethereumteam"   "bitcoin"         "blockchain"     
##  [16] "brother"         "btc"             "but"            
##  [19] "buy"             "by"              "can"            
##  [22] "cash"            "coin"            "copy"           
##  [25] "crypto"          "cryptocurrency"  "currency"       
##  [28] "days"            "de"              "deposit"        
##  [31] "dietbitcoinfork" "do"              "dont"           
##  [34] "escobarinc"      "escobars"        "eth"            
##  [37] "ethereum"        "exchange"        "fee"            
##  [40] "for"             "free"            "from"           
##  [43] "get"             "has"             "have"           
##  [46] "here"            "how"             "i"              
##  [49] "ico"             "if"              "in"             
##  [52] "is"              "it"              "its"            
##  [55] "join"            "just"            "know"           
##  [58] "like"            "litecoin"        "make"           
##  [61] "market"          "may"             "me"             
##  [64] "million"         "miss"            "money"          
##  [67] "more"            "murthaburke"     "n"              
##  [70] "nblockchain"     "new"             "news"           
##  [73] "next"            "not"             "now"            
##  [76] "of"              "on"              "only"           
##  [79] "or"              "our"             "out"            
##  [82] "over"            "pablo"           "people"         
##  [85] "per"             "post"            "price"          
##  [88] "retweet"         "rightbtc"        "rt"             
##  [91] "running"         "says"            "some"           
##  [94] "telegram"        "than"            "that"           
##  [97] "the"             "this"            "to"             
## [100] "tokens"          "trading"         "tron"           
## [103] "trx"             "up"              "us"             
## [106] "use"             "via"             "we"             
## [109] "what"            "why"             "will"           
## [112] "with"            "worth"           "you"            
## [115] "your"

Bitcoin appears to be the most used word and then rt is the second most used, probably because it is an abbreviation of retweet and real time. Many retweets occured and people follow the coin chart in real time.

##  [1] "a"             "ai"            "americana"     "and"          
##  [5] "apenas"        "as"            "bicho"         "brasileiro"   
##  [9] "chega"         "começo"        "como"          "confirmado"   
## [13] "da"            "das"           "de"            "dia"          
## [17] "e"             "el"            "elle"          "en"           
## [21] "eram"          "essa"          "estreia"       "eu"           
## [25] "fitas"         "for"           "gt"            "harryegirl"   
## [29] "hugogloss"     "i"             "im"            "in"           
## [33] "injustiçado"   "is"            "it"            "its"          
## [37] "kkk"           "la"            "le"            "lord"         
## [41] "lordnfarquaad" "lt"            "m"             "maio"         
## [45] "may"           "mayo"          "me"            "minha"        
## [49] "n"             "na"            "nearquaad"     "netflix"      
## [53] "nexo"          "no"            "nossa"         "o"            
## [57] "of"            "olha"          "on"            "para"         
## [61] "pas"           "plataforma"    "pra"           "que"          
## [65] "reasons"       "reasonswhy"    "reasonswhybra" "rt"           
## [69] "season"        "segunda"       "sehun"         "sempre"       
## [73] "senhora"       "serie"         "streaming"     "série"        
## [77] "temporada"     "the"           "this"          "to"           
## [81] "um"            "watch"         "weareoneexo"   "why"          
## [85] "y"             "you"           "ª"

Those above are words that occured more than 200 time in netflix tweets.

As we can see from the above plot , netlix word has the highest frequency among all words mentioned in the tweet, which make sense because the tweets are about netflix stock.

##   [1] "a"                            "ai"                          
##   [3] "all"                          "americana"                   
##   [5] "amp"                          "and"                         
##   [7] "apenas"                       "are"                         
##   [9] "as"                           "at"                          
##  [11] "be"                           "bicho"                       
##  [13] "bossartkilian"                "brasileiro"                  
##  [15] "busted"                       "but"                         
##  [17] "cant"                         "cest"                        
##  [19] "cette"                        "chega"                       
##  [21] "começo"                       "como"                        
##  [23] "confirmado"                   "d"                           
##  [25] "da"                           "das"                         
##  [27] "date"                         "de"                          
##  [29] "dia"                          "e"                           
##  [31] "el"                           "elle"                        
##  [33] "en"                           "eram"                        
##  [35] "es"                           "essa"                        
##  [37] "esse"                         "estreia"                     
##  [39] "estreno"                      "eu"                          
##  [41] "first"                        "fitas"                       
##  [43] "for"                          "grandmère"                   
##  [45] "gt"                           "h"                           
##  [47] "harryegirl"                   "hugogloss"                   
##  [49] "i"                            "im"                          
##  [51] "in"                           "injustiçado"                 
##  [53] "is"                           "it"                          
##  [55] "its"                          "jusquà"                      
##  [57] "just"                         "kkk"                         
##  [59] "la"                           "lanchennetflixntransar"      
##  [61] "las"                          "le"                          
##  [63] "longtemps"                    "lord"                        
##  [65] "lordnfarquaad"                "lt"                          
##  [67] "m"                            "ma"                          
##  [69] "mai"                          "maio"                        
##  [71] "may"                          "mayo"                        
##  [73] "me"                           "minha"                       
##  [75] "my"                           "n"                           
##  [77] "na"                           "nearquaad"                   
##  [79] "netflix"                      "netflixnnubdentro"           
##  [81] "new"                          "nexo"                        
##  [83] "no"                           "nossa"                       
##  [85] "now"                          "nuit"                        
##  [87] "o"                            "of"                          
##  [89] "olha"                         "on"                          
##  [91] "original"                     "out"                         
##  [93] "para"                         "pas"                         
##  [95] "pc"                           "plataforma"                  
##  [97] "por"                          "pra"                         
##  [99] "previouslyserie"              "ptdrrrrr"                    
## [101] "q"                            "que"                         
## [103] "queria"                       "reasons"                     
## [105] "reasonswhy"                   "reasonswhybra"               
## [107] "reasonswhynnuaatenciónuannel" "regardé"                     
## [109] "rt"                           "saison"                      
## [111] "se"                           "season"                      
## [113] "segunda"                      "sehun"                       
## [115] "sempre"                       "senhora"                     
## [117] "serie"                        "series"                      
## [119] "show"                         "so"                          
## [121] "streaming"                    "sur"                         
## [123] "série"                        "só"                          
## [125] "t"                            "temporada"                   
## [127] "that"                         "the"                         
## [129] "this"                         "to"                          
## [131] "um"                           "un"                          
## [133] "une"                          "vc"                          
## [135] "ver"                          "videoubn"                    
## [137] "watch"                        "we"                          
## [139] "weareoneexo"                  "why"                         
## [141] "with"                         "y"                           
## [143] "ya"                           "yo"                          
## [145] "you"                          "ª"                           
## [147] "é"

Most frequent words after removing stop words

As for Bitcoin, netflix appears to be the most used word and then rt is the second most used, probably because it is an abbreviation of retweet and real time. Many retweets occured and people follow the stock price change chart in real time.

word negative positive sentiment
absurd 1 0 -1
abuse 1 0 -1
accurate 0 2 2
achievable 0 1 1
achievement 0 1 1
advantage 0 1 1

As we can see from the sentiment contribution, people tend to have more negative feeling when the miss a change to trade positively .On the other hand, people are more positive when they are making profit, they like it more and achieve the highest level of positive sentiment. The same apply to netflix data.

5.Summary

This project tried to shed light on the factors classifying the price of Bitcoins and Netflix stock in the short-run as well as in the long-run. We built an empirical model incorporating Kernel PCA and Logistic Regression but also extended the existing literature by taking Twitter sentiment into account. Specifically, we used sentiment analysis to measure the sentiment ratio of Twitter users concerning Bitcoins and Netflix stock on a daily basis. After dealing with issues of extracting and cleaning tweets, we estimated that our Twitter sentiment ratio has a positive short-run impact on Bitcoin prices as well as Netflix stock price. Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts. Bitcoin offers a novel open door for forecast due its relatively young age and resulting volatility, but fluctuations in a market are difficult to predict.

References :
         https://en.wikipedia.org/wiki/Netflix
       https://www.cnbc.com/2018/04/16/netflix-earnings-q1-2018.html
       https://www.ccn.com/only-another-four-million-bitcoin-will-be-mined-heres-why/
         http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
       http://www.milanor.net/blog/performing-principal-components-regression-pcr-in-r/
      https://www.tidytextmining.com/sentiment.html 
       https://www.r-bloggers.com/analysing-cryptocurrency-market-in-r/